The Polygenic Risk Score Knowledge Base offers a centralized online repository for calculating and contextualizing polygenic risk scores

The process of identifying suitable genome-wide association (GWA) studies and formatting the data to calculate multiple polygenic risk scores on a single genome can be laborious. Here, we present a centralized polygenic risk score calculator currently containing over 250,000 genetic variant associations from the NHGRI-EBI GWAS Catalog for users to easily calculate sample-specific polygenic risk scores with comparable results to other available tools. Polygenic risk scores are calculated either online through the Polygenic Risk Score Knowledge Base (PRSKB; https://prs.byu.edu) or via a command-line interface. We report study-specific polygenic risk scores across the UK Biobank, 1000 Genomes, and the Alzheimer’s Disease Neuroimaging Initiative (ADNI), contextualize computed scores, and identify potentially confounding genetic risk factors in ADNI. We introduce a streamlined analysis tool and web interface to calculate and contextualize polygenic risk scores across various studies, which we anticipate will facilitate a wider adaptation of polygenic risk scores in future disease research.


Supplementary Figure 5: GWA study browser interface
Supplementary Figure 5: GWA study browser. The GWA study browser can be found under the "Studies" tab at prs.byu.edu or at "Option 2: Search for a specific study or trait" on the command-line interface menu.

Supplementary Note 1: Linkage disequilibrium clumping
Linkage disequilibrium (LD) clumping files are stored on our server and are available online by using the population and reference genome. The URL for the European population (EUR) using reference genome hg38 would be written as follows: https://prs.byu.edu/get_clumps_download_file?refGen=hg38&superPop=EUR In order to account for LD in the PRS calculations, we pre-computed the LD regions for each variant in the 1000 Genomes database. We used the PLINK 2 LD Clumping command, which requires reference genotype data in order to calculate LD between the variants present in a target input file.
We use 1000 Genomes data for both the reference and target files. To create the reference files, we downloaded 1000 Genomes variant call format (VCF) files and separated each file by super population (African, American, East Asian, European, and South Asian). Next, we used PLINK to convert each VCF file into a binary file set by running the following command, where ${population} is one of the population specific VCF files.

PLINK --vcf ${population} --make-bed
The variants intended to be grouped into LD regions are found in the ${variant_file}. The PLINK --clump operation scans the ${variant_file} and extracts fields with the headers 'SNP' (reference SNP ID number) and 'P' (p-value). For each population, we created a variant file, where the 'SNP' column contained the variants listed in the corresponding population-filtered VCF file. The 'P' column typically contains p-values for the association that each SNP has with a designated trait. This value is used to arrange the variants within each clump based on their association with the given trait. Since our intent was to identify LD regions for all variants from 1000 Genomes, regardless of their association with a certain trait, we filled every value in this column with '0'. As a result, the final LD clumps are unordered.
For each super population (African, American, East Asian, European, and South Asian), we executed the LD clumping command as follows, where ${reference_data} refers to the population-specific binary file set and ${variant_file} indicates the file with the list of rsIDs.